The very first step of data modeling and machine learning is to understand your data. This critical procedure will determine what methods to be used in the following data analytics process. Whether a clear output variable is identified, the data type of that variable and scale of such variable are key questions to be addressed in the stage of exploratory data analysis. Visualizing the data is first and foremost of the entire data analytics process.
John Tukey (1977) remarks the over-emphasis on statistical significance or the hypothesis confirmation process leaves the other important part of data analysis amiss, which is what Garrett Grolemund and Hadley Wickham term as hypothesis generation. Tukey suggests that the Exploratory Data Analysis (EDA) is to suggest hypothesis confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.
John W. Tukey
“An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question” - John W. Tukey
John T. Behrens lists the objectives of EDA for researchers to:
Grolemund and Wickham describe the EDA process as an iterative cycle:
At the end of the process, EDA will lead to a decision of what methods to adopt in the next stage.
Type
Scale
Chart thought starter
Univariate
Groups
Bivariate or Multivariate Relationship
Time series
Matrix
Scatter plot matrix
Ensemble plot
Univariate
Frequency Table (descr::freq())
Histogram (base::hist())
Bar chart
Pie chart
Area chart
Bivariate/Multivariate
Qualitative (Groups/Categorical) 1. Bar chart 2. Line chart
Quantitative (Continuous/Numeric) 1. Scatter plot 2. Bubble plot
Time series
Trend/line time series plot
## Gentle Machine Learning
## Exploratary Data Analysis
## Adapted from Grolemund, Garrett, and Hadley Wickham. 2018
## R for data science. Ch.7 (https://r4ds.had.co.nz/).
# install.packages("tidyverse")
library(tidyverse)
# Plot diamonds data
attach(diamonds)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut)) +
theme_bw()
# Simple table
# %>% forward piping operator - forward programming
diamonds %>%
count(cut)
## # A tibble: 5 x 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
# Frequency table with chart
# install.packages("descr")
library(descr)
freq(diamonds$cut)
## diamonds$cut
## Frequency Percent Cum Percent
## Fair 1610 2.985 2.985
## Good 4906 9.095 12.080
## Very Good 12082 22.399 34.479
## Premium 13791 25.567 60.046
## Ideal 21551 39.954 100.000
## Total 53940 100.000
library(RColorBrewer)
# What is the carat variable?
descr(diamonds$carat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.4000 0.7000 0.7979 1.0400 5.0100
# A histogram divides the x-axis into equally spaced bins and then uses
# the height of a bar to display the number of observations per each bin.
hist(carat)
# Another look
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5) +
theme_bw() # Why this looks different from hist(carat)
# Another another look
freq(diamonds$carat)
## diamonds$carat
## Frequency Percent
## 0.2 12 2.225e-02
## 0.21 9 1.669e-02
## 0.22 5 9.270e-03
## 0.23 293 5.432e-01
## 0.24 254 4.709e-01
## 0.25 212 3.930e-01
## 0.26 253 4.690e-01
## 0.27 233 4.320e-01
## 0.28 198 3.671e-01
## 0.29 130 2.410e-01
## 0.3 2604 4.828e+00
## 0.31 2249 4.169e+00
## 0.32 1840 3.411e+00
## 0.33 1189 2.204e+00
## 0.34 910 1.687e+00
## 0.35 667 1.237e+00
## 0.36 572 1.060e+00
## 0.37 394 7.304e-01
## 0.38 670 1.242e+00
## 0.39 398 7.379e-01
## 0.4 1299 2.408e+00
## 0.41 1382 2.562e+00
## 0.42 706 1.309e+00
## 0.43 488 9.047e-01
## 0.44 212 3.930e-01
## 0.45 110 2.039e-01
## 0.46 178 3.300e-01
## 0.47 99 1.835e-01
## 0.48 63 1.168e-01
## 0.49 45 8.343e-02
## 0.5 1258 2.332e+00
## 0.51 1127 2.089e+00
## 0.52 817 1.515e+00
## 0.53 709 1.314e+00
## 0.54 625 1.159e+00
## 0.55 496 9.195e-01
## 0.56 492 9.121e-01
## 0.57 430 7.972e-01
## 0.58 310 5.747e-01
## 0.59 282 5.228e-01
## 0.6 228 4.227e-01
## 0.61 204 3.782e-01
## 0.62 135 2.503e-01
## 0.63 102 1.891e-01
## 0.64 80 1.483e-01
## 0.65 65 1.205e-01
## 0.66 48 8.899e-02
## 0.67 48 8.899e-02
## 0.68 25 4.635e-02
## 0.69 26 4.820e-02
## 0.7 1981 3.673e+00
## 0.71 1294 2.399e+00
## 0.72 764 1.416e+00
## 0.73 492 9.121e-01
## 0.74 322 5.970e-01
## 0.75 249 4.616e-01
## 0.76 251 4.653e-01
## 0.77 251 4.653e-01
## 0.78 187 3.467e-01
## 0.79 155 2.874e-01
## 0.8 284 5.265e-01
## 0.81 200 3.708e-01
## 0.82 140 2.595e-01
## 0.83 131 2.429e-01
## 0.84 64 1.187e-01
## 0.85 62 1.149e-01
## 0.86 34 6.303e-02
## 0.87 31 5.747e-02
## 0.88 23 4.264e-02
## 0.89 21 3.893e-02
## 0.9 1485 2.753e+00
## 0.91 570 1.057e+00
## 0.92 226 4.190e-01
## 0.93 142 2.633e-01
## 0.94 59 1.094e-01
## 0.95 65 1.205e-01
## 0.96 103 1.910e-01
## 0.97 59 1.094e-01
## 0.98 31 5.747e-02
## 0.99 23 4.264e-02
## 1 1558 2.888e+00
## 1.01 2242 4.156e+00
## 1.02 883 1.637e+00
## 1.03 523 9.696e-01
## 1.04 475 8.806e-01
## 1.05 361 6.693e-01
## 1.06 373 6.915e-01
## 1.07 342 6.340e-01
## 1.08 246 4.561e-01
## 1.09 287 5.321e-01
## 1.1 278 5.154e-01
## 1.11 308 5.710e-01
## 1.12 251 4.653e-01
## 1.13 246 4.561e-01
## 1.14 207 3.838e-01
## 1.15 149 2.762e-01
## 1.16 172 3.189e-01
## 1.17 110 2.039e-01
## 1.18 123 2.280e-01
## 1.19 126 2.336e-01
## 1.2 645 1.196e+00
## 1.21 473 8.769e-01
## 1.22 300 5.562e-01
## 1.23 279 5.172e-01
## 1.24 236 4.375e-01
## 1.25 187 3.467e-01
## 1.26 146 2.707e-01
## 1.27 134 2.484e-01
## 1.28 106 1.965e-01
## 1.29 101 1.872e-01
## 1.3 122 2.262e-01
## 1.31 133 2.466e-01
## 1.32 89 1.650e-01
## 1.33 87 1.613e-01
## 1.34 68 1.261e-01
## 1.35 77 1.428e-01
## 1.36 50 9.270e-02
## 1.37 46 8.528e-02
## 1.38 26 4.820e-02
## 1.39 36 6.674e-02
## 1.4 50 9.270e-02
## 1.41 40 7.416e-02
## 1.42 25 4.635e-02
## 1.43 19 3.522e-02
## 1.44 18 3.337e-02
## 1.45 15 2.781e-02
## 1.46 18 3.337e-02
## 1.47 21 3.893e-02
## 1.48 7 1.298e-02
## 1.49 11 2.039e-02
## 1.5 793 1.470e+00
## 1.51 807 1.496e+00
## 1.52 381 7.063e-01
## 1.53 220 4.079e-01
## 1.54 174 3.226e-01
## 1.55 124 2.299e-01
## 1.56 109 2.021e-01
## 1.57 106 1.965e-01
## 1.58 89 1.650e-01
## 1.59 89 1.650e-01
## 1.6 95 1.761e-01
## 1.61 64 1.187e-01
## 1.62 61 1.131e-01
## 1.63 50 9.270e-02
## 1.64 43 7.972e-02
## 1.65 32 5.933e-02
## 1.66 30 5.562e-02
## 1.67 25 4.635e-02
## 1.68 19 3.522e-02
## 1.69 24 4.449e-02
## 1.7 215 3.986e-01
## 1.71 119 2.206e-01
## 1.72 57 1.057e-01
## 1.73 52 9.640e-02
## 1.74 40 7.416e-02
## 1.75 50 9.270e-02
## 1.76 28 5.191e-02
## 1.77 17 3.152e-02
## 1.78 12 2.225e-02
## 1.79 15 2.781e-02
## 1.8 21 3.893e-02
## 1.81 9 1.669e-02
## 1.82 13 2.410e-02
## 1.83 18 3.337e-02
## 1.84 4 7.416e-03
## 1.85 3 5.562e-03
## 1.86 9 1.669e-02
## 1.87 7 1.298e-02
## 1.88 4 7.416e-03
## 1.89 4 7.416e-03
## 1.9 7 1.298e-02
## 1.91 12 2.225e-02
## 1.92 2 3.708e-03
## 1.93 6 1.112e-02
## 1.94 3 5.562e-03
## 1.95 3 5.562e-03
## 1.96 4 7.416e-03
## 1.97 4 7.416e-03
## 1.98 5 9.270e-03
## 1.99 3 5.562e-03
## 2 265 4.913e-01
## 2.01 440 8.157e-01
## 2.02 177 3.281e-01
## 2.03 122 2.262e-01
## 2.04 86 1.594e-01
## 2.05 67 1.242e-01
## 2.06 60 1.112e-01
## 2.07 50 9.270e-02
## 2.08 41 7.601e-02
## 2.09 45 8.343e-02
## 2.1 52 9.640e-02
## 2.11 43 7.972e-02
## 2.12 25 4.635e-02
## 2.13 21 3.893e-02
## 2.14 48 8.899e-02
## 2.15 22 4.079e-02
## 2.16 25 4.635e-02
## 2.17 18 3.337e-02
## 2.18 31 5.747e-02
## 2.19 22 4.079e-02
## 2.2 32 5.933e-02
## 2.21 23 4.264e-02
## 2.22 27 5.006e-02
## 2.23 13 2.410e-02
## 2.24 16 2.966e-02
## 2.25 18 3.337e-02
## 2.26 15 2.781e-02
## 2.27 12 2.225e-02
## 2.28 20 3.708e-02
## 2.29 17 3.152e-02
## 2.3 21 3.893e-02
## 2.31 13 2.410e-02
## 2.32 16 2.966e-02
## 2.33 9 1.669e-02
## 2.34 5 9.270e-03
## 2.35 7 1.298e-02
## 2.36 8 1.483e-02
## 2.37 6 1.112e-02
## 2.38 8 1.483e-02
## 2.39 7 1.298e-02
## 2.4 13 2.410e-02
## 2.41 5 9.270e-03
## 2.42 8 1.483e-02
## 2.43 6 1.112e-02
## 2.44 4 7.416e-03
## 2.45 4 7.416e-03
## 2.46 3 5.562e-03
## 2.47 3 5.562e-03
## 2.48 9 1.669e-02
## 2.49 3 5.562e-03
## 2.5 17 3.152e-02
## 2.51 17 3.152e-02
## 2.52 9 1.669e-02
## 2.53 8 1.483e-02
## 2.54 9 1.669e-02
## 2.55 3 5.562e-03
## 2.56 3 5.562e-03
## 2.57 3 5.562e-03
## 2.58 3 5.562e-03
## 2.59 1 1.854e-03
## 2.6 3 5.562e-03
## 2.61 3 5.562e-03
## 2.63 3 5.562e-03
## 2.64 1 1.854e-03
## 2.65 1 1.854e-03
## 2.66 3 5.562e-03
## 2.67 1 1.854e-03
## 2.68 2 3.708e-03
## 2.7 1 1.854e-03
## 2.71 1 1.854e-03
## 2.72 3 5.562e-03
## 2.74 3 5.562e-03
## 2.75 2 3.708e-03
## 2.77 1 1.854e-03
## 2.8 2 3.708e-03
## 3 8 1.483e-02
## 3.01 14 2.595e-02
## 3.02 1 1.854e-03
## 3.04 2 3.708e-03
## 3.05 1 1.854e-03
## 3.11 1 1.854e-03
## 3.22 1 1.854e-03
## 3.24 1 1.854e-03
## 3.4 1 1.854e-03
## 3.5 1 1.854e-03
## 3.51 1 1.854e-03
## 3.65 1 1.854e-03
## 3.67 1 1.854e-03
## 4 1 1.854e-03
## 4.01 2 3.708e-03
## 4.13 1 1.854e-03
## 4.5 1 1.854e-03
## 5.01 1 1.854e-03
## Total 53940 1.000e+02
# Can you build a histogram for cut? Why not?
# Look closer at smaller carat diamonds (left portion from previous histogram)
smaller <- diamonds %>%
filter(carat < 3)
# Set small binwidth
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1) + theme_bw()
# Polygon
ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
geom_freqpoly(binwidth = 0.1) + theme_bw() +
scale_color_brewer(palette = "Spectral")
## Gentle Machine Learning
## Scatter plot matrix
## Extracted from Alexander C. Tan, Karl Ho & Cal Clark. 2020. The political
## economy of Taiwan’s regional relations, Asian Affairs: An American Review
# Check packages
doInstall <- TRUE # For checking if package is installed
toInstall <- c("openxlsx", "tidyverse", "RColorBrewer", "GGally")
if(doInstall){install.packages(toInstall, repos = "http://cran.us.r-project.org")}
##
## The downloaded binary packages are in
## /var/folders/qp/s6y46pq11y13t0gpnf4_v9vm0000gp/T//Rtmpazzsxp/downloaded_packages
lapply(toInstall, require, character.only = TRUE) # call into library
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
# Import data from GitHub
imfgrowth = openxlsx::read.xlsx("https://github.com/datageneration/gentlemachinelearning/raw/master/data/imfgrowth.xlsx")
attach(imfgrowth)
# imfgrowth = rename(imfgrowth, US = "United.States")
imf8019 = imfgrowth[which(imfgrowth$Year<2020),]
imf8019$decade = as.factor(imf8019$decade) # Change decade into factor
attach(imf8019)
# Create group for comparison
# NSP is the countries targeted by Taiwan in its New Sound Bound Policy (2016)
tcuan = data.frame(China, Taiwan, United.States, NSP, ASEAN)
# Pairwise scatterplot matrix
# Specifying font, subject to font availabiliy on local computer
ggpairs(tcuan) + theme_bw() +
theme(text = element_text(size=12, family = "Palatino"))
## Bivariate scatterplots with regression line
ggduo(
tcuan,
types = list(continuous = "smooth_lm")) + theme_bw()
## Scatter plot matrix
## Choose variables to be plotted
ggscatmat(imf8019, columns = 20:24, alpha = 0.8) +
theme_bw() +
theme(text = element_text(size=12, family = "Palatino"), ) +
labs(y = "Economic growth, 1980-2018",x = "Economic growth, 1980-2018") +
scale_fill_brewer(palette="Set1") + scale_color_brewer(palette="Set1")
## Gentle Machine Learning
## Scatter plot matrix
## Adapted from example in Unwin, Antony.2015. Graphical data analysis with R. Vol. 27. CRC Press.
#doInstall <- TRUE # For checking if package is installed
#toInstall <- c("pgmm", "tidyverse", "pdp", "GGally", "grid", "gridExtra")
#if(doInstall){install.packages(toInstall, repos = "http://cran.us.r-project.org")}
# lapply(toInstall, library, character.only = TRUE) # call into library
library(pgmm)
library(tidyverse)
library(pdp)
library(GGally)
library(grid)
library(gridExtra)
# Load data
# Data on the chemical composition of coffee samples collected from around the
# world, comprising 43 samples from 29 countries. Each sample is either of the
# Arabica or Robusta variety. Twelve of the thirteen chemical constituents
# reported in the study are given.
# The omitted variable is total chlorogenic acid; it is generally the sum of
# the chlorogenic, neochlorogenic and isochlorogenic acid values.
data(coffee, package="pgmm")
coffee <- within(coffee, Type <- ifelse(Variety==1,
"Arabica", "Robusta"))
names(coffee) <- abbreviate(names(coffee), 8)
a <- ggplot(coffee, aes(x=Type)) + geom_bar(aes(fill=Type)) +
scale_fill_manual(values = c("grey70", "red")) +
guides(fill=FALSE) + ylab("") +
theme_bw() +
theme(text = element_text(family="Palatino"))
b <- ggplot(coffee, aes(x=Fat, y=Caffine, colour=Type)) +
geom_point(size=2) +
scale_colour_manual(values = c("grey70", "red")) +
theme_bw() +
theme(text = element_text(family="Palatino"))
c <- ggparcoord(coffee[order(coffee$Type),], columns=3:14,
groupColumn="Type", scale="uniminmax",
mapping = aes(size = 1), splineFactor = TRUE ) +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","red")) +
theme_bw() +
theme(text = element_text(family="Palatino"))
# Combine into one page using grid
grid.arrange(arrangeGrob(a, b, ncol=2, widths=c(1,2)),
c, nrow=2)
References
Unwin, Antony. 2015. Graphical data analysis with R. Boca Raton, FL: CRC Press. Grolemund, Garrett, and Hadley Wickham. 2018 R for data science."* (https://r4ds.had.co.nz/).